compositional_distributional_semanticsfandomcom-20200214-history
Multi-modal and grounded models
As we move further toward language understanding, it is essential to ask what meaning really is. Words, as manipulated in computer, are mere symbols. How could we know what they stand for in reality, if there is? Developing NLP systems that are grounded is therefore a must for true understanding. Although it is controversial what makes a symbolic system grounded, to move from uni-modal to multi-modal systems is clearly a step forward and provides much more meaning. Researchers have tried different methods to harvest meaning from text, images, sound and videos. This growing body of research is to be surveyed. In regard to the purpose of models, we can find researches that use images to add meaning to words Bruni, E., Tran, G., & Baroni, M. (2011). Distributional semantics from text and images. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics. Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2140493 Bruni, E., Uijlings, J., & Baroni, M. (2012). Distributional Semantics with Eyes : Using Image Analysis to Improve Computational Representations of Word Meaning. and researches that use text to help image searching Socher, R., Le, Q. V, Manning, C. D., & Ng, A. Y. (2013). Grounded Compositional Semantics for Finding and Describing Images with Sentences.. TODO: noticeable visual-textual model: (Chrupała et al. 2015) TODO: Incorporating relational knowledge: Xu et al. (2014)RC-NET: A General Framework for Incorporating Knowledge into Word Representations, (pdf) Chang Xu, Yalong Bai, Jiang Bian, Bin Gao, Gang Wang, Xiaoguang Liu, and Tie-Yan Liu The 23rd ACM International Conference on Information and Knowledge Management, Shanghai, China(CIKM2014). PDF. Summary and Comments on: Distributional semantics in technicolor Bruni, E., Boleda, G., Baroni, M., & Tran, N. (2012). Distributional semantics in technicolor. · Build a computational model of words that are perceptually grounded. · Perception = vision. · Compare textual distributional models to multimodal ones. · Results, image DS models are: o Worse that text DS models in general tasks (semantic relatedness), o As good as or better than text DS models for modelling words with visual correlates such as color terms (also in literal\non-literal uses), o Model of different aspects of meaning as compared to text DS models. · Connect language and perception – exploit visual information to build better models of word meaning: o Comparison of text DS, image DS, combined and hybrid models, o Evaluate them on general semantic relatedness tasks and on specific tasks where visual information might be relevant. Textual models · Four different text DS models were built: Window2, Window20, Document, Distributional Memory (DM). · General set-up: o Corpus used: ukWaC + Wackypedia (1.9b and 0.82b tokens), o Local Mutual Information is a measure of the variables' mutual dependence, it consider the xxx, contrary to PMI it counteracts the tendency to favor rare events by multiplying PMI by raw counts, o Same set of target words: 20k most frequent nouns, 5k most frequent adjectives, 5k most frequent verbs in the combined corpora (30k target words), o Collocates: the same 30k words. · Win2 and Win20: counting co-occurrences of targets and collocates within a window of fixed width. Co-occurrence matrix: target x 30k collocate. · Document: targets words are represented as distribution over documents (so the co-occurrence matrix is replaced by a target x 30k document matrix. · Distributional Memory (DM): with lexico-syntactic and dependency relations. Visual models · Corpus: ESP-game dataset, 100k labeled images. 20,515 distinct tags with an average of 4 tags per image (note: label is the set of tags of a particular image). · A vector of visual features for each tag in the dataset. · Bag of visual words model (BoVW): 1. Definition of the VW’s, 2. Encoding of each image in terms of distribution of VW’s, 3. Construction of tag level VW’s vectors by aggregating all the vectors of the images labelled with the same tag. · Definition of the VW’s (vocabulary building): 1. From each image extract descriptor features from relevant areas (?), 2. Descriptor features are then clustered, 3. Cluster centroids are VW’s (k-means). · Encoding of images (~quantization): 1. Each descriptor of a novel image is assigned to the nearest VW , 2. The image is therefore represented as a vector of counts of how many time each VW from the vocabulary appears in it. · Tag vectors: 1. Once every image in the dataset (100k) is represented as a vector of counts over the VW vocabulary, 2. Tag representations are obtained by summing the occurrences of each VW across every images tagged with that tag, 3. Raw counts are transformed in LMI (tag ~= target, VW ~= collocate). · What features for the descriptors? o Scale Invariant Feature Transform: § Invariant to image scale and rotation, incorporate also color information (RGB scale). § K-means clustering (k = 500, 1000, 1500, 2000, 2500), § Each image is partitioned in a 4 x 4 grid, § Therefore each image is finally encoded as a k x 16 VW count vector. o LAB: § Based on 3 dimensions (1 for brightness and 2 for color: red-green and yellow-blue), § Sampled for each pixel, § K-means (with k between 128 and 1,024). Multimodal models · Assemble text DS and image DS vectors for the same target/tag word. · Concatenation by linear weighted combination function (more sophisticated methods are explored later, but stil…). · Weights estimated from MEN data (2,000 word pairs). · Implementation of the method can be found here: https://github.com/s2m/FUSE Hybrid models · Based solely on the co-occurrence of tags in the same images. · ESP-Win: tags are represented in terms of co-occurrences with other tags in the image label. · ESP-Doc: tags are represented in terms of their co-occurrence with images, each image considered as a document (therefore a dimension as in Document text DS model). Testing: general semantic models · Testing correlation between each model and human judgments on word similarity and relatedness. · Human judgment sources: WordSim353 and MEN. o WordSim353: 353 word pairs similarity from 16 subjects judgments. o MEN: 3,000 pairs of words that figure as ESN-dataset tags – ratings of relatedness using Amazon Mechanical Turk (http://clic.cimec.it/~elia.bruni/MEN.html). · Spearman correlation between models based pair similarity and human judgment similarities. Experiment 1 · Distance between object-vector and color-vector is minimized for object-typical color pairs for those models that are sensible to visual information. · Method: o Select 52 nouns (concrete, present in all models, and occuring more than 100 times with color terms), o Select 11 basic color words (black, blue, brown, green, grey, orange, etc.), o Measure the cosine between each noun and the 11 color words, o Rank the similarity and record the position of the correct (most typical for the authors) color word. · Results consider the median rank for the correct color word for each model, plus the number of times the model rank the correct color first. Experiment 2 – distinguishing between literal and nonliteral uses of color terms · Data: phrases labeled as literal and nonliteral by the authors. About 227 literal and 115 nonliteral. · Prediction: higher similarity between noun and color word for literal pairs, as compared to nonliteral pairs. · Similarity computed as cosine. References